## Contents #### Insight 1: Passenger Numbers#### Insight 2: Cash versus Credit #### Insight 3: Fare Breakdown#### Insight 4: Pick-up and Drop-off Locations #### Insight 5: Average Fare by Day and Time#### Insight 6: Busiest City Locations## Summary____**Solutions to** the **bold questions** below are included in this notebook____###### Suggested Basic Questions:1. What are the **distributions of the number of passengers per trip** (see Insight 1), **payment type, fare amount, tip amount, and total amount** (see Insights 2 & 3)?2. What are top 5 busiest hours of the day, and the **top 10 busiest locations of the city**? (see Insight 6)3. What is the **hourly taxi activity for each day of the week** (see Insight 5)?4. **Which trip has the most consistent fares** (see Insight 2)? Manhattan to JFK Airport (set fare of $52)###### Suggested Open Questions:1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?3. ** If you were a taxi owner, how would you maximize your earnings in a day? ** * Work the early shift (The data show above average fares from 3 am until 7 am)4. **If you run a taxi company, how would you maximize your earnings?** * In short: More data needed! Uber is a major market disruptor in the taxi space. To maximise taxi company earnings is necessary to discover how old school taxis can strategically adapt to thrive in current market conditions. Data needed to support the taxi company to maximise their earnings going forward could include: * Concurrent analysis of Uber versus taxi data * Trends within taxi data for the last 2-3 years * --- The data show that most taxis are hailed from busy streets (Insight 4). On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi). Issue temporarily addressed with UberT (could request a yellow taxi to your door through the Uber app, $2 surcharge to Uber, service ended Aug 2016). --- Solutions to the bold questions below are included in this notebook
What are the distributions of the number of passengers per trip (see Insight 1), payment type, fare amount, tip amount, and total amount (see Insights 2 & 3)?
What are top 5 busiest hours of the day, and the top 10 busiest locations of the city? (see Insight 6)
What is the hourly taxi activity for each day of the week (see Insight 5)?
Which trip has the most consistent fares (see Insight 2)? Manhattan to JFK Airport (set fare of $52)
Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?
Can you predict the pickup / drop off geographical distribution for each hour of a weekday?
If you were a taxi owner, how would you maximize your earnings in a day?
If you run a taxi company, how would you maximize your earnings?
In short: More data needed!
Uber is a major market disruptor in the taxi space. To maximise taxi company earnings is necessary to discover how old school taxis can strategically adapt to thrive in current market conditions.
Data needed to support the taxi company to maximise their earnings going forward could include:
The data show that most taxis are hailed from busy streets (Insight 4). On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi). Issue temporarily addressed with UberT (could request a yellow taxi to your door through the Uber app, $2 surcharge to Uber, service ended Aug 2016).
import pandas as pdimport numpy as npimport matplotlib import matplotlib.pyplot as plt import numpy as npimport plotly.plotly as pyfrom plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plotimport plotly.figure_factory as ffimport plotly.graph_objs as gofrom plotly import toolsfrom IPython.display import Imagefrom IPython.display import display, Math, Latex from IPython.core.display import HTML #initiate the Plotly Notebook modeinit_notebook_mode()df_big = pd.read_csv('../data/yellow_tripdata_2016-01.csv')#df_big_clean=df_big.fillna(df_big.mean())#df_big.dropna(axis=1)df_big_clean=df_big#df_big_clean <- df_big[!(is.na(df$start_pc) | df$start_pc==""), ] #| is an or-operator and ! inverts. #Hence, the command above displays all rows, which are not b) NA or b) equal to ""df=df_big_clean.loc[0:10000,:] #use reduces data points for testing mode#df=df_big # use whole month of dataprint(df_big.shape)print(df_big_clean.shape)df#help(plotly.offline.iplot)## Insight 1: Passenger numbers * Most NY Taxi trips transport solo passengersimport numpy as npimport plotly.plotly as py#import plotly.offline as offlinefrom plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plotimport plotly.graph_objs as goinit_notebook_mode()#extract number of people per trippeps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]peps_per_trip_df.shape#print(type(peps_per_trip_df))peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values#print(type(peps_per_trip))#layout=go.Layout(title="First Plot", xaxis={'title':'x1'}, yaxis={'title':'x2'})data = [go.Histogram(x=peps_per_trip)] #or [dataset1, darset2]layout = go.Layout( title='Histogram of Passenger numbers', xaxis=dict( title='passenger number' ), yaxis=dict( title='Count' ), bargap=0.2, bargroupgap=0.1)fig = go.Figure(data=data, layout=layout)py.iplot(fig, filename='People_per_trip_histogram') #this plots in online mode, limit of 50/day in community a/c#iplot(fig, filename='People_per_trip_histogram') #This plots when offline; no limit## Insight 2: Cash versus Credit * New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)* Cash usage remains considerable at 40%. The cash option is a point of difference over competitor Uber. * Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)* Peak at $\$52$ represents Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia) * NY taxi fares are cheap (compared to Melbourne!). Median fare around \$10# Distribution: Payment by type#df=df_big #uncomment to run on whole dataset# Add histogram data# extract fares by payment type# 1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided tripfare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values #credit cardfare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values #cash#fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values #disputefare_payments=np.append(fare_paymenttype1,fare_paymenttype2)total_paymentstype1=df.loc[df['payment_type'] == 1, 'total_amount'].values #fare+tips+tolstotal_paymentstype2=df.loc[df['payment_type'] == 2, 'total_amount'].values #fare+tips+tolstip_amountstype1=df.loc[df['payment_type'] == 1, 'tip_amount'].values #fare+tips+tolstotal_payments=np.append(total_paymentstype1,total_paymentstype2)numberofCCpays=df.loc[df['payment_type'] == 1, 'payment_type'].sum()numberofCashpays=df.loc[df['payment_type'] == 2, 'payment_type'].sum()/2PcentofCCpays=np.round(numberofCCpays*100/(numberofCashpays+numberofCCpays), decimals=1)#print(PcentofCCpays)PcentofCashpays=np.round(numberofCashpays*100/(numberofCashpays+numberofCCpays), decimals=1)#print(PcentofCashpays)#print(type(fare_paymenttype2[1:10]))# Group data togetherhist_data = [fare_paymenttype1,fare_paymenttype2]find_median1=np.median(fare_paymenttype1)find_median2=np.median(fare_paymenttype2)#print(find_median)group_labels = ['Credit card', 'Cash']# Create distplot with custom bin_sizefig = ff.create_distplot(hist_data, group_labels, bin_size=1.0)fig.layout.update({'title': 'Distribution of Fares'})fig.layout.xaxis1.update({'title': '$ amounts'})# Plot!#py.iplot(fig, filename='Distplot with Multiple Datasets') #online plot modeiplot(fig, filename='Distplot with Multiple Datasets') #offline modedisplay(Math(r'\text{Percentage of credit card payments is } %s \text{%%}' % PcentofCCpays))display(Math(r'\text{Median credit payment is \$} %s ' % find_median1))display(Math(r'\text{Percentage of cash payments is } %s \text{%%}' % PcentofCashpays))display(Math(r'\text{Median cash payment is \$} %s' % find_median2))## Insight 3: Fare Breakdown* Median Tip (credit card data only) is 20% of the fare# Group data togetherhist_data2 = [fare_payments,total_payments,tip_amountstype1]group_labels2 = ['Fare', 'Total Charge', 'Tip Amount']# Create distplot with custom bin_sizefig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=[0.5,0.5,0.4])fig2.layout.update({'title': 'Breakdown & Distribution of NY Taxi Fares'})fig2.layout.xaxis1.update({'title': '$ amounts'})# Plot!#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot optioniplot(fig2, filename='Distplot with Multiple Datasets2') # offline plot optionfind_mediantip=np.median(tip_amountstype1)Med_tip_percentage=np.round(find_mediantip*100/find_median1, decimals=1)display(Math(r'\text{Median tip payment (Credit card payment data only) is \$} %s ' % find_mediantip))display(Math(r'\text{Median tip percentage (Credit card payment data only) is } %s \text{%%}' % Med_tip_percentage))## Insight 4: Pick-up and Drop-off Locations * Manhattan (central business zone) is the busiest area for taxi use* Airports (La Guardia and JFK) feature strongly in usage maps * Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse * Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks; a GPS issue; meters started on the move? * People **start taxi journeys** most frequently: 1. in Manhattan on the **main streets** 2. on the **main arterial routes** within residential areas (Brooklyn, Queens) * The *Sex And The City* imagery of hailing taxis on demand from busy streets is backed up by the data. Interesting in times of Uber. * People **end taxi journeys** most frequently: 1. again in Manhattan, both on main streets and off the main streets 2. at very **diffuse locations** across residential areas (Brooklyn, Queens, The Bronx) * The Bronx is a frequent drop-off location, but rarely a pick-up location * An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)# Map the pick up locationsimport pandas as pdimport matplotlib import matplotlib.pyplot as plt from matplotlib import rcParams df=df_big#pd.options.display.mpl_style = 'default' #Better Styling matplotlib.pyplot.style.use('ggplot')new_style = {'grid': False} #Grid off matplotlib.rc('axes', **new_style) rcParams['figure.figsize'] = (12, 12) #Size of figure rcParams['figure.dpi'] = 250P=df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)#P.set_axis_bgcolor('black') #Background ColorP.set_facecolor('black') #Background Colour#plt.show()# Map the drop off locationsdf=df_bigimport matplotlib import matplotlib.pyplot as plt from matplotlib import rcParams ##Inline Plotting for jupyter Notebook #%matplotlib inline #pd.options.display.mpl_style = 'default' #Better Styling matplotlib.pyplot.style.use('ggplot')new_style = {'grid': False} #Grid off matplotlib.rc('axes', **new_style) rcParams['figure.figsize'] = (12, 12) #Size of figure rcParams['figure.dpi'] = 250P=df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3) #s is size and alpha is opaque-ness P.set_facecolor('black') #Background Colourplt.show()## Insight 5: Average fare by day and time* Average fare is similar over weekdays* Early birds catch the worm: Taxi drivers operating from 3:00 am to 7:00 am earn above average fares Average fare is similar over weekdays
Early birds catch the worm: Taxi drivers operating from 3:00 am to 7:00 am earn above average fares
# Times of the day versus average fare.#df1=[]df=df_big #renaming for test stageprint(df.shape)# Make new column in dataframe with hour of day and day of the weekdf['hour'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.hourdf['day'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.dayofweek#find mean fare by weekdaymeanfare_byhour=[] #initialisefor i in range(0,24): fares_byhour=df.loc[df['hour'] == i, 'fare_amount'].values #hourly fares meanfare_byhour.append(np.mean(fares_byhour)) #print(i) #print(meanfare_byhour)#Numeric weekday convention is 0:'SUN', 1:'Mon', 2:'Tue',3:'Wed',4:'Thu',5:'Fri',6:'Sat'#find mean fare by weekdaymeanfare_byweekday=[] #initialise#print(meanfare_byweekday)for i in range(0,7): fare_byweekday=df.loc[df['day'] == i, 'fare_amount'].values #weekday fares meanfare_byweekday.append(np.mean(fare_byweekday)) #print(i) #print(meanfare_byweekday)#print(meanfare_byhour)meanacrosshoursofday=np.mean(meanfare_byhour)#plot bar chart of mean fare by weekdaydata = [go.Bar( x=['Sun', 'Mon', 'Tue','Wed','Thu','Fri','Sat'], y=meanfare_byweekday )]layout = go.Layout( xaxis=dict(tickangle=-45), barmode='group', title='Mean Fare by Weekday', yaxis=dict( title='$' ),)fig = go.Figure(data=data, layout=layout)iplot(fig, filename='basic-barWeekday') #plot bar chart of mean fare by hour of daytraceBar1 = go.Bar( x=['0:00', '1:00', '2:00','3:00','4:00','5:00','6:00', '7:00','8:00','9:00','10:00', '11:00', '12:00','13:00','14:00','15:00','16:00', '17:00','18:00','19:00','20:00', '21:00', '22:00','23:00','24:00'], y=meanfare_byhour, name = 'hourly mean fare' )trace2 = go.Scatter( x=['0:00','24:00'], y=[meanacrosshoursofday,meanacrosshoursofday], mode='lines', name = 'overall mean' )layout2 = go.Layout( xaxis=dict(tickangle=-45), barmode='group', title='Mean Fares by Hour', yaxis=dict( title='$' ),)#syntax note for two traces bar and line in one plot:#trace1=go.bar( ... )#trace2=go.bar( ... )#data2=[trace1,trace2]#fig2 = go.Figure(data=data2,...#or include square brackets#data2=[go.bar( ... )]#fig2 = go.Figure(data=data2,...data2 = [traceBar1, trace2]#print([meanacrosshoursofday,meanacrosshoursofday])fig2 = go.Figure(data=data2, layout=layout2)iplot(fig2, filename='basic-barHour') ## Insight 6: Busiest City Locations* Manhattan x 9, plus JFK airportx
#Top 10 busiest locations of the cityimport reverse_geocoder as rgfrom geopy.geocoders import Nominatimimport gmplotTopnum=10 #Find top number (Topnum) busiest locations in citydf=df_big#round the lat and long entries #Latitude_round=df.loc[df['payment_type'] == 1, 'fare_amount'].valuesLatitude_round = (np.round(df['pickup_latitude'].values/2, decimals=2))*2+0.005 #round and recentre grid boxLongitude_round = (np.round(df['pickup_longitude'].values/2, decimals=2))*2+0.005 #round and recentre grid box#print(Latitude_round[0:5])#print(Longitude_round[0:5])df.loc[:,'GridcodeLat'] = pd.Series(Latitude_round, index=df.index) #add column gridcodes to dfdf.loc[:,'GridcodeLon'] = pd.Series(Longitude_round, index=df.index) #add column gridcodes to df#find 10 locations with most common grid codesmytable = df.groupby(['GridcodeLat','GridcodeLon']).size()mytable.sort_values(inplace=True,ascending=False)totaltrips=mytable.sum()print('Total trips')print(totaltrips)Top10BusyPickupLocations=mytable.head(Topnum)#print(Top10BusyPickupLocations)#print(type(Top10BusyPickupLocations))Top10BusyPickupLocations=Top10BusyPickupLocations.to_frame() #find values for later pie chart of top 10 busiest locations by percentage trip pick upsnum_trips=np.array(Top10BusyPickupLocations)num_trip_perc=num_trips*100/totaltripsothertrips=100-sum(num_trip_perc)num_trip_perc=np.append(num_trip_perc,othertrips)#print(Top10BusyPickupLocations)#print(type(Top10BusyPickupLocations))coordinates = Top10BusyPickupLocations.index.values.tolist()marker_lats = np.array(coordinates)[:,0]marker_lngs = np.array(coordinates)[:,1]#radaii=np.arange(30,10,-(30-10)/Topnum)gmap = gmplot.GoogleMapPlotter(40.75, -73.9, 11) #manual map location boundaries: center_lat, center_lng, zoomgmap.plot([40.85], [-73.95], 'cornflowerblue', edge_width=10)gmap.heatmap(marker_lats, marker_lngs, threshold=5, radius=10, gradient=None, opacity=0.6, dissipating=True)gmap.draw("mymap.html")%%html%%<iframe src="mymap.html", width="1000">#Issues opening in jupyter due to needing API key from google. To be fixed. Meanwhile open mymap.html file from directory.#plot pie chart of Top 10 busiest locationsNYToplabels=['Midtown, Manhattan', 'Penn Station, Manhattan', 'Grand Central Station, Manhattan', 'Upper East Side, Manhattan', 'Lennox Hill, Manhattan', 'Lower Manhattan', 'Hells Kitchen, Manhattan', 'Upper West Side, Manhattan', 'East Village, Manhattan', 'John F. Kennedy International Airport', 'All other areas'] # Add graph datatrace1={'labels': NYToplabels, 'values': np.append(num_trips,totaltrips-sum(num_trips)), 'type': 'pie', 'name': 'Pick up', 'domain': {'x': [0, 1], 'y': [.4, 1]}, 'hoverinfo':'label+percent+name', 'textinfo':'none' }data = [trace1]layout = go.Layout( #xaxis=dict(tickangle=-45), #barmode='group', title='Top Taxi Pick-up Locations', #yaxis=dict( # title='$' #),)fig = go.Figure(data=data, layout=layout)# Plot!iplot(fig)#help(gmplot.GoogleMapPlotter)#help(HTML)# find addresses of co-ordinates..found two ways of doing this. Addresses are very awkward to handle due to inconsistancy between addresses # Let's go google maps instead (later implemented in above cells)results = rg.search(coordinates) # default mode = 2, reverse geocode from lat and long to addressprint(results)geolocator = Nominatim()#locations = geolocator.reverse("40.755, -73.985")for i in range(0,Topnum): location = geolocator.reverse(coordinates[i]) PlaceNames=location.address.split(",") print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] ) #df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index) #add column f to df1#plot table or pie chart